Online Expectation Maximization for Reinforcement Learning in POMDPs

نویسندگان

  • Miao Liu
  • Xuejun Liao
  • Lawrence Carin
چکیده

We present online nested expectation maximization for model-free reinforcement learning in a POMDP. The algorithm evaluates the policy only in the current learning episode, discarding the episode after the evaluation and memorizing the sufficient statistic, from which the policy is computed in closedform. As a result, the online algorithm has a time complexity O ( n ) and a memory complexity O(1), compared to O ( n ) and O(n) for the corresponding batch-mode algorithm, where n is the number of learning episodes. The online algorithm, which has a provable convergence, is demonstrated on five benchmark POMDP problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Episodic Reinforcement Learning by Logistic Reward-Weighted Regression

It has been a long-standing goal in the adaptive control community to reduce the generically difficult, general reinforcement learning (RL) problem to simpler problems solvable by supervised learning. While this approach is today’s standard for value function-based methods, fewer approaches are known that apply similar reductions to policy search methods. Recently, it has been shown that immedi...

متن کامل

Efficient Planning for Factored Infinite-Horizon DEC-POMDPs

Decentralized partially observable Markov decision processes (DEC-POMDPs) are used to plan policies for multiple agents that must maximize a joint reward function but do not communicate with each other. The agents act under uncertainty about each other and the environment. This planning task arises in optimization of wireless networks, and other scenarios where communication between agents is r...

متن کامل

Factorized Asymptotic Bayesian Policy Search for POMDPs

This paper proposes a novel direct policy search (DPS) method with model selection for partially observed Markov decision processes (POMDPs). DPSs have been standard for learning POMDPs due to their computational efficiency and natural ability to maximize total rewards. An important open challenge for the best use of DPS methods is model selection, i.e., determination of the proper dimensionali...

متن کامل

Expectation Maximization for Weakly Labeled Data

We call data weakly labeled if it has no exact label but rather a numerical indication of correctness of the label “guessed” by the learning algorithm a situation commonly encountered in problems of reinforcement learning. The term emphasizes similarities of our approach to the known techniques of solving unsupervised and transductive problems. In this paper we present an on-line algorithm that...

متن کامل

Learning for Decentralized Control of Multiagent Systems in Large, Partially-Observable Stochastic Environments

Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general framework for multiagent sequential decision-making under uncertainty. Although Dec-POMDPs are typically intractable to solve for real-world problems, recent research on macro-actions (i.e., temporally-extended actions) has significantly increased the size of problems that can be solved. However, current...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013